see where there are null values to deal with missing data

Missing Values

Replace NaN with median of the not missing values in the columns function

calling the replace NaN with median function for all of the features that have missing data

Dropped all of these columns beacuse they had too much missing data, would have to replace 1/5 of values with median which would mess with results

Outliers

Chose not to modify any of the outliers because there are not many outliers that are outside the range of possibilty for all of these features. Getting rid of the outliers may impact our results because we want to see if some of these factors are increasing or decreasing accuracy. Later on when selecting features we can get rid of some of the features with many outliers and see if they are worth dealing using.

Feature Selection

Created correlations plots for all numerical data. Anywhere where there is colinearity I wanted to drop that feature so that there was less noise for the ML algorithms.

hematocrit, MCH, MCHC, MCV, PT should not be selected due to high colinearity with other features (I did not drop RBC or INIR because they are colinear with hematocrit and PT respectively, so since those features are getting dropped then it would not make sense to get rid of them both).

Transformations

Made some histograms to see which plots needed to be transformed. If they were heavily skewed I tried to transform them, the list of skewed columns are listed a few cells down

These columns are heavily skewed and should be transformed: age, BMI, SP O2, Urine output, RDW, Leuocyte, Platelets, PT, INR, NT-proBNP, Creatinine, Urea nitrogen, glucose, Magnesium ion

The rest of the numerical_cols are (relatively) normally distributed and do not need to be transformed.

Here I made a list of the columns that needed to be transformed then tested them all with a square root transformation and a log transformation and plotted all of the histograms (original, sqrt transformation, log transformation) and picked whichever one was the most normally distributed.

Transformation Choices
age - none
BMI - log
SP O2 - none
Urine output - sqrt
RDW - log
Leucocyte - sqrt
Platelets - sqrt
PT - none
INR - none NT-proBNP - log Creatinine - log
Urea nitrogen - log
glucose - log
Magnesium ion - none

Here I made a list for which columns should get which transformations and plugged them all into a their respective function.

Centering and Scaling

These are some functions that center and scale dataframes based on the slides + class. Centering is done by subtracting the mean of the column from all values in the column and then scaling is done by calculating and dividing by the standard deviation

Modelling and Accuracy

Made to dataframes, one with all of the features that would not create noise which would predict the second dataframe values (outcome)

Did a test train split that was shown in all the class examples. Also made a function that would print out the accuracy and AUC scores for each ML algorithm selected

Chose to use a decision tree first because it deals well with noisy data and well with outliers. Since I did not deal with outliers this seemed like a good choice. It also works for classification problems which is what we are dealing with here

Chose random forest model second because it has high accuracy and provides reliable feature importance estimate. Also, one of the drawbacks of this model is that it takes a while to make predictions, however since we are dealing with static data the length of time that it takes to make predictions does not matter much.